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[57] ABSTRACT 

This invention provides a novel computer design that is 
capable of utilizing large numbers of very large scale 
integrated (VLSI) circuit chips as a basis for efficient 
high performance computation. This design is a static 
dataflow architecture of the type in which a plurality of 
dataflow processing elements communicate externally 
by means of input/output circuitry, and internally by 
means of packets sent through a routing network that 
implements a transmission path from any processing 
element to any other processing element. This design 
effects processing element transactions on data accord- 
ing to a distribution of instructions that is at most par- 
tially ordered. These instructions correspond to the 
nodes of a directed graph in which any pair of nodes 
connected by an arc corresponds to a predecessor- 
successor pair of instructions. Generally each predeces- 
sor instruction has one or more successor instructions, 
and each successor instruction has one or more prede- 
cessor instructions. In accordance with the present in- 
vention, these instructions include associations of exe- 
cution components and enable components identified 
by instruction indices. 

30 Claims, 17 Drawing Sheets 
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function Quadratic ( 
a, b, c: real) 
returns (complex, complex) 
type complex « record (re, im: real 1 
let D : = b # b - 4.0 ' a * c; 

Y : = 1 / (2.0 • a); 
in if D S 0.0 

then 

let X : = SqRt (D) 

in record (re: (-b ♦ X) • Y; im: 0.01 
record Ire: (-b - X) • Y; im: 0.01 
endlet 

else 

let X : ■ SaRt (-D) 
in record Ire: -b • Ys im: X • Yl 
record Ire: -b • Y; im: -X * Yl 
endlet 
endlf 
endlet 
endfun 

The Quadratic Formula Written In Val 



Fig. 3 
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function Smoothi 

Q: Grid; X state values 

S: Grid: X residuals 

D: Grid: X Jacoblan 

smu: real: X smoothing parameter 

n: Integer X grid size 

returns Grid) X smoothed data 

type Grid « array [arrayfarrayt real 111: 

let 

sml: real : = .5 # smu; 

In 

forall j in (1, nl, k in 11, nl, 1 in 11, n) 
construct 

If J • 1 . k ■ 1 1 • 1 j = n k = n 1 = n 
then X boundary point — no change 

SU, k, 1) 
elseif J ■ 2 J « n - 1 

then X point is next to boundary in j -direction 
X — use second order formula 
SI J, k. II ♦ smi # ( 

♦ QU ♦ 1, k, 1] * D[j ♦ l, k, 11 
- 2.0 * QU, k, 1] * DIj, k, 1] 

♦ Qlj - 1, k, 11 # Dfj - Is k, i) 
) /DIJ, ki 1) 

else X interior point - - use fourth order formula 
S( j , k, 1 ] - smu * ( 

♦ QU ♦ 2, k, 11 • DIj + 2, k, 11 

- 4.0 * Q[j + L k, 11 * Dfj ♦ 1, k, 1 ) 

♦ 6.0 * QU, k, 1] # DIj, k, 1) 

- 4.0 # 0(J - L k, 11 * DIJ - 1, k, 1) 

♦ QU - 2, k, 11 * DIj - 2, k, 11 
) / DIj , k, 11 

endif 
endall; 
endlet 
endfun 

The Smooth Function Witten in Val for One Physical 
Quantity and for One Direction of Processing, 



Fig. 31 
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DATAFLOW PROCESSING ELEMENT, 
MULTIPROCESSOR, AND PROCESSES 

BACKGROUND OF THE INVENTION 

1. Field Of The Invention 

The present invention relates to parallel computer 
architecture and processes, and, more particularly, to 
dataflow processing elements, dataflow computers 
comprising a plurality thereof, and dataflow processes. 10 
Still more particularly, the present invention relates to 
high performance numerical computing to which data- 
flow computer architecture and techniques are particu- 
larly applicable. 

2. The Related Art 15 
It is now generally recognized that efficient applica- 
tion of modern integrated circuit technology to high 
performance computation requires use of highly paral- 
lel computer architectures — machines having many 
processing sites (processing elements) where stored 20 
instructions are activated and executed. The usual form 

of a highly parallel computer is a collection of a number 
of sequential processing elements that intercommuni- 
cate either by sharing access to a common global mem- 
ory, or by sending messages to one another by means of 25 
some form of interconnection network. Computers with 
this architecture are limited in the performance they can 
achieve in numerical computation for several reasons, a 
principal one being the reduction of performance in- 
volved in coordinating operations in the several pro- 30 
cessing elements. Dataflow computer architecture is a 
concept for the organization of processing elements in 
which instructions are activated promptly after the data 
values on which they operate become available for use. 
In the dataflow computer of the present invention, 35 
many dataflow processing elements are interconnected 
by a packet routing network that allows any processing 
element to send packets of information to any other 
processing element These processing elements are ca- 
pable of processing many instructions concurrently and 40 
incorporate an efficient mechanism for indicating which 
instructions are ready for execution. This structure 
performs the synchronization functions for coordina- 
tion of concurrent activities, including the conditioning 
of instruction activation in one processing element on 45 
the arrival of a packet from another processing element. 
This dataflow computer is capable of greater efficiency 
in highly parallel computation than is possible with 
other known parallel computer architectures. In con- 
trast with a conventional machine language program, 50 
which corresponds to a sequentially executed array of 
instructions, a dataflow machine program corresponds 
to a directed graph in which each node represents a 
dataflow instruction and each arc represents a data 
dependency of one instruction on the result produced 55 
by another instruction. Descriptions of the form and 
behavior of dataflow programs use the terminology of 
the theory of directed graphs, and are related to the 
system modeling theory known as Petri nets. The un- 
conventional form of dataflow programs requires a 60 
significant change in the way programs are written and 
transformed into machine language. In place of lan- 
guages like Fortran, which are designed for efficient 
program execution on sequential computers, functional 
programming languages such as Val are better suited 65 
for programming dataflow computers. Functional lan- 
guages have the property that the flow of data values 
from definition to use is directly evident from the text 



and structure of the source language program. In a 
language such as Fortran, complicated analytical tech- 
niques are needed to identify the flow of data. In some 
Fortran programs, this flow may be impossible to deter- 
mine. 

REFERENCES 

The present invention is an advance beyond certain 
earlier dataflow concepts, which are disclosed and 
claimed in the following U.S. patents, and in which the 
present inventor is a joint or sole inventor: 

1. U.S. Pat. No. 3,962,706, dated June 8, 1976, for Data 
Processing Apparatus for Highly Parallel Execution 
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computation. This design is a dataflow architecture of 
the type in which a plurality of dataflow processing 
elements communicate externally by means of input- 
/output circuitry, and internally by means of packets 
sent through a routing network that implements a trans- 3 
mission path from any processing element to any other 
processing element This design effects processing ele- 
ment transactions on data according to a distribution of 
instructions that is at most partially ordered. These 
instructions correspond to the nodes of a directed graph 10 
in which any pair of nodes connected by an arc corre- 
sponds to a predecessor-successor pair of instructions. 
Generally each predecessor instruction has one or more 
successor instructions, and each successor instruction 
has one or more predecessor instructions. In accor- 15 
dance with the present invention, these instructions 
include associations of execution components and en- 
able components identified by instruction indices. 

A more specific object of the present invention is to 
provide data processing means for effecting data pro- 20 
cessing transactions according to programs of instruc- 
tions, which data processing means comprises: execu- 
tion means for effecting execution transactions and en- 
able means for effecting enable transactions, comple- 
tions of execution transactions causing transmission of 25 
fire signals from the execution means to the enable 
means, completions of enable transactions causing 
transmission of fire signals from the enable means to the 
execution means; the instructions including execution 
specifiers having enable indices, given associations of 30 
execution indices and enable indices constituting given 
instruction indices; the enable means being repsonsive 
to sets of the done signals for performing enable transac- 
tions according to the enable specifiers, the execution 
means being responsive to sets of the fire signals to 35 
perform execution transactions according to the execu- 
tion specifiers on sets of operators and operands; the 
occurrence of a particular fire signal being conditioned 
on the reception by the enable means of at least one of 
the done signals, each of which corresponds to an en- 40 
able specifier that refers to the instruction index associ- 
ated with the particular fire signal. 

A still more specific object of the present invention is 
to provide data processing means for effecting process- 
ing transactions according to programs of instructions 45 
including directed predecessor-successor pairs of in- 
structions, a predecessor instruction having one or more 
successor instructions, a successor instruction having 
one or more predecessor instructions, the data process- 
ing means comprising first means for transacting in- 50 
struction execution components specifying operands 
and operators to produce execution results pertinent to 
further operands and further operators, and execution 
completion signals pertinent to these execution results; 
second means for transacting instruction enable compo- 55 
nents specifying sequencing to produce enable events 
pertinent to further sequencing, and enable completion 
signals pertinent to these enable events; associations of 
the execution components and the enable components, 
and associations of the execution completion signals and 60 
the enable completion signals having associated indices 
corresponding to the instruction indices; and third 
means for transmitting the enable completion signals 
from the second means to the first means, and for trans- 
mitting the execution completion signals from the first 65 
means to the second means; transaction of a given exe- 
cution component and transaction of a given enable 
component, which have associated indices correspond- 



ing to one of the instruction indices, occurring respec- 
tively in the first means and the second means; transac- 
tion of a given successor instruction being contingent 
on transmission of execution completion signals and 
enable completion signals of the predecessor instruc- 
tions of the given successor instruction from the first 
means to the second means and from second means to 
the first means, by the third means. The structure of the 
processing element is particularly adapted for use with 
a routing network that is software switched, rather than 
hardware switched. The arrangement is such that trans- 
actions, including transactions between different pro- 
cessing elements, occur at times that are arbitrary with 
respect to a given order of the instructions, thereby 
achieving a high percentage of peak performance. 

Other objects will in part be obvious and will in part 
appear hereinafter. 

BRIEF DESCRIPTION OF THE DRAWINGS 

For a fuller understanding of the nature and objects 
of the present invention, reference is made to the fol- 
lowing description, which is to be taken in connection 
with the accompanying drawings wherein: 

FIG. 1 illustrates a dataflow processing element em- 
bodying the present invention; 

FIG. 2 illustrates a dataflow computer comprising a 
plurality of dataflow processing elements of the type 
shown in FIG. 1; 

FIG. 3 illustrates certain programming principles of 
Val, a functional language that is supported by the data- 
flow computer of FIG. 2; 

FIG. 4 illustrates dataflow graph principles that cor- 
respond to the programming principles of FIG. 3; 

FIG. 5 illustrates details of an execution system and 
an enable system that are components of the processing 
element of FIG. 1; 

FIG. 6 illustrates the instruction execution compo- 
nent format of a processing element of the present in- 
vention. 

FIGS. 7 to 16 illustrate details of an instruction set for 
the execution system of FIG. 1, in the context of the 
computer of FIG. 2; 

FIGS. 17 to 23 illustrate control word formats relat- 
ing to operation of the enable system of FIG. 1, in the 
context of the computer of FIG. 2; 

FIG. 24 illustrates dataflow instructions arranged for 
pipelined execution in accordance with the present 
invention; 

FIG. 25 illustrates 2x2 routers connected to form a 
packet routing network for the illustrated embodiment 
of the computer of FIG, 2; 

FIG. 26 illustrates interprocessor communication in 
the computer of FIG. 2 in accordance with the present 
invention; 

FIG. 27 illustrates dataflow machine code for the 
quadratic formula as an example in accordance with the 
present invention; 

FIG. 28 illustrates result queueing in accordance with 
the present invention; 

FIG. 29 illustrates alternative representations of array 
values in accordance with the present invention; 

FIG. 30 illustrates the phases of smoothing operations 
in connection with use of array memories in accordance 
with the present invention; 

FIG. 31 illustrates a data smoothing function written 
in the Val programming language for one physical 
quantity and one direction of processing; 
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FIGS. 32 and 33 illustrate dataflow machine code for of the present embodiment* an arc may carry no more 

the smoothing function of FIG. 31; than one token. The significance of the dataflow model 

FIG. 34 illustrates further principles of result queue- regarding the present invention is that the nodes corre- 

ing in the present invention; spond to instructions, the arcs to functional dependen- 

FIG. 35 illustrates Petri net principles as applied to 5 des between the instructions in pairs, and the tokens to 

program analysis for the present invention; and information generated by predecessor instructions for 

FIG. 36 illustrates cycles corresponding to certain processing by successor instructions, 

dependency relationships in FIGS. 32 and 33. As shown in FIG. 4, self-explanatory nodes that rep- 

DESCRIPTION OF THE PREFERRED resent constants and arithmetic operators are shown as 

EMBODIMENT 10 circles with such labels as X 2, recip, neg, X , X 4, - , T, 

n\-r^ w»j i rr * ^ r »l a w » c F » +■ 0. The diamond shaped node 106 with 

(I) The Model of Computation for the Architecture of .rj, >n . a At ^ A „ fVtt , T „ _ . . 

FIGS. 1 and 2 as Exemplified in FIGS. 3 and 4 * c labcl ^° » a decider that performs a test on its data 

F input and sends the resulting boolean truth value to its 

Dataflow architecture makes possible a major change successors. The circular nodes inscribed with T or F are 
in the way programs are expressed. The preferred lan- 15 gates that pass a ^ valuc tf and onJy tf tbc boolean 
guage for use with the present architecture is Val, a m p Ut matches the label. At the bottom of the 

language developed at the Massachusetts Institute of grap h m four merge nodes, shown as capsules 108 
Technology, substantially as described in Ref. 2. Val is labelled<fT>, which pass one value from the specified 
a funcuotial programming buiguage that is designed to ^ mpu Ttg r each boolean token received. Except for 
allow easy determination of program parts that may be 20 the nod ^ nodes ^have according to the same 
executed s^ult^eously. ^though various feature of ^piefuing rule: a node is enabled if and only ifatoken 
w^^ J present on each of its input arcs and there is no token 

lecture is capable of supporting other languages, for °» ^ of its output arcsJThe merge node is special; it 
example, Fortran, particularly if constrained by appro- 25 ? enab ed only when 8 token * P re ^ nt ™ d * 

priate rules for ensuring identification and implements ^ token is present at the input corresponding to the 
tion of parallelism. In the program examples shown value Gmwd b * the boolean token (the output arc 

below, data type declarations have been omitted in must «np*y)- Although examples of program graphs 
order to maintain simplicity. md their Val language equivalents are given herein for 

In tliinking about computer systems and how they 30 com P !etcncss * it is not necessary to follow these exam- 
may be programmed, it is important to have a model of pies in detail in order to understand the design that is the 
computation as a guide. For conventional computers, primary subject of this disclosure, 
the model of the store (or address space) and the pro- It can be shown that graphs of the illustrated type 
gram counter selecting successive instructions for exe- exhibit the behavior necessary for generating one set of 
cution is fundamental. For dataflow computers of the 35 output data derived for each set of input data entered, 
type illustrated in FIGS. 1 and 2, a new model of com- Furthermore, the final configuration of tokens in the 
putation is required. To illustrate this model, consider graph and the values they represent is independent of 
the well known quadratic formula for the roots of a the order in which enabled nodes are fired. This prop- 
second order algebraic equation with real coefficients erty, which is called determinacy, makes it possible for 

4Q a functional programming language such as Val to ex- 
press parallelism without suffering from the timing ha- 
zards and races that make the development and debug- 
The complex roots of the quadratic are given by ^ of conventional multiprocessor programs very 

challenging. 

\r-r 45 The dataflow graph computational model is a useful 

z m -*> ± W - tec representation of the meaning of parallel programs and 

their hardware implementations. The essence of the 
A program expressed in Val to compute the two com- present invention is in an improved representation that 
plex root values is shown in FIG. 3. Note that the com- results m especially efficient design for correspond- 
plex values are represented by a user defined record 50 wg hardware implementations. These are discussed in 
type having fields for the real and imaginary compo- the followln $ sections. 

nents. A dataflow graph for this computation is given in n j) ^ illustrated Architecture— FIGS. 1 and 2 
FIG. 4. This graph consists of nodes (or actors) 100 

connected by arcs (or links) 102, with tokens (or pack- A* indicated above, the present architecture contem- 
ets) 104 being carried by the arcs and being consumed 55 plates a dataflow processing element for effecting pro- 
by the nodes to indicate the flow of information from ceasing element transactions on data according to pro- 
predecessor nodes to successor nodes. Although the grams of particularly related instructions. These in- 
terms "tokens** and "packets" are sometimes used inter- structions are such that directed predecessor-successor 
changeably, "tokens" is used in discussing the abstract pairs of instructions are related by functional dependen- 
dataflow model, and "packets" is used in discussing 60 cies, a predecessor instruction having one or more suc- 
implementations of the model. The nodes respond to the cessor instructions, and a successor instruction having 
presence of tokens on their input arcs by "firing" — ap- one or more predecessor instructions, 
plying nodes to ready (entry) tokens on input arcs to As shown at 110, the illustrated processing element 
produce result (derived) tokens on output arcs. Because comprises an execution system 1 12 and an enable system 
a node can fire whenever it has ready tokens on its input 63 114 having components that will be described in detail 
arcs and token-free output arcs, dataflow graphs can be in connection with FIG. 5. Generally, the components 
configured in stages in the same manner as conventional of the execution system include execution memory, 
pipelined functional units. In the static dataflow design arithmetic/logic, and execution control circuitry; and 
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the components of the enable system include enable done command, arid recordation of this transaction in 
memory and enable control circuitry, of which the the indicator memory. The arrangement is such that 

enable memory includes accumulation memory and processing element transactions, in any processing ele- 
distribution memory circuitry. Execution system 112 ment and between processing elements, occur at times 
and enable system 114 are operatively connected for 5 that are arbitrary with respect to a given order of the 

communication therebetween of fire commands 116 and instructions, thereby achieving a high percentage of 

done commands 118. The instructions include execution peak performance. 

components which are processed by execution system . Qirc _ - 

112 and enable components which are processed by ^ ExecuUon System-FIG. 5 
enable system 114, each instruction being identified by a 10 As shown in FIG. 5, execution system 112 includes 

unique instruction index. Fire commands 116 and done functional units 132, 134 as a functional system and 

commands 118 include instruction index specifiers, switching circuitry 136. The functional system per- 

Generally, the fields of the instruction execution forms arithmetic and/or logic operations on operand 

components include opcodes and, optionally, data con- values to produce result values. In addition to result 

stants and/or input data address specifiers and output IS values, the functional units produce completion codes 

data address specifiers. In other words, an instruction which indicate the outcomes of various tests on the 

execution component typically includes an opcode, result values, e,g., zero/non-zero, negative/positive, 

optionally a data constant, one or more input data ad- Functional unit 132 performs floating point multiplica- 

dress specifiers, and one or more output data address tion. Functional unit 134 performs floating point addi- 

specifiers. The input data include operand values and 20 tion and fixed point arithmetic. Functional units 132, 

the output data include result values and/or result con- 134 are pipelined, each consisting of several logic stages 

ditions. Execution system 112 stores instruction execu- separated by memory latches, and arranged so that, 

tion components at memory locations corresponding to whenever an instruction is enabled, the arguments of its 

instruction index specifiers and data at memory loca- operation, say A and B, are fed into the first stage of the 

tions corresponding to certain data address specifiers. 25 functional unit and at some later time the result R is 

An instruction execution transaction involving the presented by the final stage. The control is arranged in 

arithmetic/logic circuitry includes retrieval of an in- such a way that several sets of arguments may be en- 

struction execution component and corresponding tered into the functional unit before an earlier result is 

input data from memory locations in execution system delivered. Such functional units are sold by Weitek 

112, processing of an instruction execution component 30 under the trade designations WTL 1064 and WTL 1065 

by the arithmetic/logic circuitry to produce output and by Analog Devices under the trade designations 
data, and storage of output data in memory locations of ADSP-3210 and ADSP-3220. Routing circuitry 136 is a 

the execution system 112. The execution system is re- gate array for multiplexing and demultiplexing, 

sponsive to a fire command 116 to cause an instruction Execution system 112 includes also an execution in- 

execution transaction, and, after its completion, with or 35 struction memory 138, an execution control memory 

without indication of a result condition, to cause gener- 140, four data memories (DM) 142, 144, 146, 148, and an 

ation of a done command 118. execution system controller 150. Each of the instruction 

Generally, the fields of the instruction enable compo- execution components, all of which are stored in execu- 
nents include control fields for indicating the enabled tion instruction memory 138, includes at least one op- 
status of instructions, and lists of the indices of enable 40 code, references to one or two operand values (con- 
components of predecessor and successor instructions stants and/or operand address specifiers), and a result 
together with code governing interpretation of instruc- address specifier corresponding to the destined location 
tion completion codes that accompany the done com- of the result value Data memories 142, 144, 146, 148 
mands for certain instructions. The control fields of a store operand values and result values so referenced 
given instruction typically include a count field and a 45 according to execution instructions being currently 
reset field. The reset field is fixed during program exe- applied to functional system 132, 134. The architecture 
cution. The count field changes during program execu- provides indexed addressing for the data memories and 
tion to reflect completion of selected successor and array memories and incrementing to establish modes of 
predecessor instructions. Enable system 114 stores in- operand fetch and result store suitable for generating 
struction enable components at memory locations cor- 50 FIFO queues in the data memories and array memory, 
responding to instruction indices. An instruction enable The storage locations of these operand values and result 
transaction includes accumulating a count for a given values correspond, respectively, in a dataflow graph, to 
count field until an enable count is indicated, transmit- tokens on the input arcs and the output arcs of a firing 
ting a fire command to execution system 112, receiving node. Execution control memory 140 stores control 
a done command from execution system 112, and dis- 55 words for indirect addressing and control blocks for 
tributing signals representing the done command to the first-in-first-out (FIFO) queues allocated in the data 
indicators of the successor and predecessor instructions. memories and the array memory. 

The multiprocessor of FIG. 2 embodies a plurality of Execution system controller 150 mediates execution 

processing elements 110 and a plurality of array memo- transactions as directed by its interaction with the en- 

ries 120, one processing element directly addressing 60 able system. A signal via fire path 116 is triggered when 

only one array memory. The processing elements com- an enable condition has occurred, and a done signal via 

municate internally via paths 122 through a routing done path 118 is triggered when the execution system 

network 124 and externally via paths 126 through an has completed an execution transaction. Following 

input/output network 128. The multiprocessor is con- receipt of fire signal 116 by execution controller 150, 

trolled by a monitor controller 130. 65 the opcode and operand values corresponding to the 

A processing element transaction includes generation operand specifiers are accessed and a result value corre- 

of a fire command following occurrence of an enable spending to the result specifier is assigned by functional 

condition, an arithmetic/logic transaction to generate a system 132, 134. An operand specifier or a result speci- 
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tier may variously refer to a location within the data condition that is required before that given node can 

memory, the array memories or the control memory. fire again. Since, as will be explained below, a predeces- 

As shown: execution instruction memory 138 is a sor node cannot detect when a token it has output has 

static random access memory (SRAM) of 4K by 64 bits, been consumed by its successor node, the successor 

that is read-only except during program loading; con- 5 node must explicitly signal the predecessor node when 

trol memory 140 is a 4K by 16 bit SRAM; each data it has consumed the token. The done signal corresponds 

memory 142, 144, 146, 148 is a 16K by 32 bit SRAM; to what may be thought of as an implicit acknowledge- 

each array memory 120 is a dynamic random access ment arc directed from a successor node to a predeces- 

memory with 1 megaword of storage and is capable of sor node. In the form shown: enable instruction mem- 
holding large blocks of data such as numerical matrices. 10 ory 154 is a static random access memory; enable con* 

As shown: execution system controller 150 and execu- troller 156 is a custom gate array; and priority selector 

tion instruction memory 138 communicate via address 158 is a "fairness 1 * circuit that ensures execution of any 

path 159 and data path 157; execution system controller enabled instruction within a predetermined elapsed time 

150 and execution control memory 140 communicate following its enablement, i.e. a predetermined number 
via a data path 161 and an address path 163; execution 15 of tuning cycles following the time its enable count 

system controller 150 and data memories 142, 144, 146, becomes 0. Priority selector 158 and the registers of 

148 communicate via address paths 165, 167, 169, 171; indicator memory 152, including 6-bits for each of 4096 

data memories 142, 144, 146, 148 and multiplex/demul- instructions, are realized as a full custom or semi-cus- 

tiplex circuit 136 communicate via data paths 173, 175, torn integrated circuit— on one chip in a first embodi- 
177, 179; multiplex/demultiplex circuit 136 and arith- 20 ment, and on several chips in a second embodiment. As 

metic/logic units 132, 134 communicate via data paths shown: enable priority selector 158 and count registers 

181, 183, 185; and multiplex/demultiplex circuit and 153 communicate via data path 189; count registers 153 

execution system controller 150 communicate via data and enable system controller 156 communicate via data 

path 187. path 193 and address path 203; count registers 153 and 

^_ Tr . _ _ _ _ 25 reset registers 155 communicate via address path 191; 

(IV) The Enable System-FIG. 5 coun| registers ^ resct reg isters 153,155 and enable 

As shown in FIG. 5, enable system 114 comprises an system controller 156 communicate via address path 

indicator memory 152, a distribution memory 154, an 197 and data path 195; and enable system controller 156 

enable system controller 156, and a priority selector and distribution memory 154 communicate through 
158. Indicator memory 152 includes indicator registers 30 address path 199, and data path 201. 

153, 155 which respectively, for a given instruction, _ 4 . . 
provide a current enable count that records completions 00 ^ Supercomputer of FIGS. 2 and 8 

of executions of related predecessor and successor in- A powerful dataflow supercomputer is constructed 
structions optionally depending on related result condi- by connecting many dataflow processing elements as 
tions, and a reset enable count that restores the registers 35 illustrated in FIG. 2. In the preferred embodiment, the 
to reset condition when the given instruction has been interconnection network is a packet routing network 
selected for execution. The current enable count is a that delivers information packets sent by any source 
dynamically varying quantity that represents the total processing element to any specified destination process- 
number of signals execution system 112 must receive in ing element. The packet routing network shown in 
reference to a given instruction between a given time 40 FIG. 25, which has N input and N output ports, may be 
and the time it becomes enabled. By definition, an in- assembled from (N/2)log2(N) units, each of which is a 
struction is enabled when its current enable count be- 2X2 router. A 2x2 router receives packets at two input 
comes 0. When an instruction cell is fired, its current ports and transmits each received packet at one of its 
enable count is reset to a reset enable count. The reset output ports according to an address bit contained in 
enable count is the total number of signals an instruction 45 the token. Packets are handled flrst-come-first-serve, 
cell must receive since its last firing before becoming and both output ports may be active concurrently, 
enabled again. In relation to the corresponding data- Delay through an NxN network increases as log2N, 
flow graph, the reset enable count is the number of and capacity rises nearly linearly with N. 
nodes that must fire for each instance of firing of the A dataflow program to be executed on this computer 
node corresponding to the given instruction. Distribu- 50 is divided into parts that are allocated to the several 
tion memory 154 stores lists of instruction indices and, processing elements. Monitor processors support the 
optionally, conditional instructions, the list of instruc- translation of programs from a high level language such 
tion indices of a given instruction corresponding to the as Val, and the construction of dataflow machine lan- 
set of given successor and predecessor dependencies. In guage programs. The monitor processors also direct the 
its various forms, the list of an enable instructions takes 55 loading of machine language programs into the data- 
such forms as a simple list of instruction identifiers, or flow processing elements and monitor and control their 
conditional list (e.g., "if the result is zero, signal instruc- execution. The manner in which the parts of programs 
tion cell Z, otherwise signal cell J"). Priority selector located in the several processing elements communicate 
158 arranges the transmission of fire signals through fire through the routing network will be described below in 
path 116 from enable system 114 to execution system 60 connection with FIG. 26. 

112. Enable indicator 152, enable instruction memory In one form, this supercomputer, which includes 256 

154, and priority selector 158 are under the control of dataflow processing elements of the aforementioned 
enable system controller 156, which receives done sig- type, is designed to perform of the order of 1 billion 
nals through done path 118 as instruction execution floating point operations per second. To meet the re- 
transactions are completed. A done signal corresponds 65 quirements of scientific computation, arithmetic units 
to an invisible return arc between a predecessor node 52, 54 support high-precision (64-bit) numbers. To 
and a successor node, by which the relevant output arc achieve this level of performance, it is necessary, as in 
of the predecessor node is known to be token free, a the present embodiment, that the floating point func- 
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tional units be closely coupled to data memories 142, 
144, 146, 148, which hold numerical values between 
program executions. 

(VI) Further Details and General Operation 
As has been indicated and as will be described in 
greater detail below, the instruction format for the pro- 
cessing element of the present invention has two com- 
ponents, one of which is used by execution system 112 
to direct instruction execution, and the other of which is 
used by enable system 114 to control the enabling of 
other instructions. The former component uses three- 
address and two-address instruction formats and speci- 
fies an operation which takes operand values from and 
stores a result value in the data memory. The enable 
component includes a signal list, an enable count and a 
reset count. The signal list is a specification of which 
instructions are to be sent signals upon completion of 
instruction execution. 

The enable system operates as follows. Whenever a 20 
signal is sent to instruction i, its enable count is decre- 
mented by one. When the enable count becomes zero, 
instruction i is marked as "enabled". Using a "fair 1 ' 
selection rule, the enable system sends fire commands to 
the execution system for instructions marked "enabled." 
Thus, an instruction will consist of a opcode, one or two 
source specifiers, a result specifier, and a signal list. The 
two components of the instruction may be thought of 
as: the portion that specifies the operation and data to be 
operated upon; and the portion that concerns the se- 
quencing of instruction execution. The following mate- 
rial presents one of the possible designs for the instruc- 
tion set of a dataflow computer that embodies the dis- 
closed invention. Section VII below describes the exe- 3J 
cution component of the instruction. Section VIII 
below describes the enable component of the instruc- 
tion. 

(VII) Instruction Set and Operation— Execution 
Component 

The general formats for the execution components of 
dataflow instructions are shown in FIG. 6. The execu- 
tion format has two forms: a form with opcode, two 
operand specifiers and a result specifier, and a form 45 
with opcode, one operand specifier and a result speci- 
fier. The second form is distinguished by a special code 
in the unused first field. 

(a) Instruction Format The nomenclature is as follows: 

50 



25 



These data are interpreted as the following different 
data types: 

(1) Long floating point numbers (LR): represented in 
accordance with the IEEE floating point standard 
with adjustments to accommodate the Val error val- 
ues pos-over, pos-under, neg-under, neg-over, un- 
known, undef, and zero-divide . 

(b 2) Short floating point numbers (SR): represented in 
accordance with the IEEE floating point standard 
with adjustments to accommodate the Val error val- 
ues pos-over, pos-under, neg-under, neg-over, un- 
known, undef, and zero-divide. 

(3) Fixed point binary numbers (SF): consisting of an 
error flag, a sign bit and a 30 bit magnitude field (in 
two's complement form). If the error flag bit is one, 
the remaining bits contain an encoding of the Val 
error values pos-over, neg-over, unknown, undef, 
and zero-divide. 

(4) Bit strings (LB,SB,HB,BB): uninterpreted bit strings 
of length 64, 32, 16 or 8. 

(5) Binary numbers (HI.BI): Binary integers in two's 
complement form (16- and 8- bits) with no error 
codes. 

(6) Characters (BC): Symbols in the ASCII character 
set represented as 8-bit bytes. 

(c) Identity Operations 



30 



40 



Lid: 
Sid: 



Long Identity 
Short Identity 



LB— LB 

SB — SB 



(d) Operations for Real Arithmetic 



LMul: 


Long Real Multiply 


LR X LR 


— LR 


LAdd: 


Long Real Add 


LR X LR 


— LR 


LSub: 


Long Real Subtract 


LR X LR 


— LR 


LCmp; 


Long Real Compare 


LR X LR 


-*<) 


SMul: 


Short Real Multiply 


SR X SR 


- SR 


SAdd: 


Short Real Add 


SR X SR 


- SR 


SSub: 


Short Real Subtract 


SR X SR 


-SR 


SCmp: 


Short Real Compare 


SR X SR 


-() 



(Support for Real Divide and Square Root} 



(e) Operations for Fixed Point Arithmetic 



FAdd: 


Fixed Add 


SF X SF -* 


SF 


FSUB: 


Fixed Subtract 


SF X SF — ► 


SF 


FMulL* 


Fixed Multiply 


SF X SF — 


SF X SF 


FMullHigh: 


Fixed Fraction Mult 


SF X SF — 


SF 


FMultLow: 


Fixed Integer Mult. 


SF X SF — 


SF 


FCmp: 


Fixed Compare 


SF X SF — 


() 



55 



opcode: operation code 

modes: addressing modes (2 bits per 

specifier) 

br; break (for program testing) 
oper-spec: operand source specification 
a-opcT-spcc: a-operand source specification 
b-opcr-spec: b-operand source specification 
result-spec: result specification 

60 

(b) Data Types 

The data types supported by the hardware use four 
sizes of binary word with letter designations as follows: 

(1) L-long(64 bits) 65 

(2) S-short(32 bits) 

(3) H-half(16 bits) 

(4) B-byte(8 bits) 



(Support for Fixed Divide and Modulo] 



(0 Arithmetic Shifts 



FSHL: 
FSHR: 
FSHR: 
FSHR: 



Fixed Shift Left 
Fixed Shift Right 
Long Shift Left 
Long Shift Right 



SF X SF - SF 
SF X SF - SF 
LF X SF— LF 
LF X SF— LF 



(g) Type Conversion Operations 



LFix: Long Real to Fixed LR — SF 

LMan: Long Real Extract Mantissa LR — SF 
LExp: Long Real Extract Exponent LF — SF 
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SFuu 


Short Real to Fixed 


SR — SF 


SMan: 


Short Real Extract Mantissa 


SR-*SF 


SExp: 


Short Real Extract Exponent 


SR^SF 


LFlo&t: 


Fixed to Long Real 


SF -» LR 


SFlott: 


Fixed to Short Real 


SF-SR 



Note: Other conversions, with truncation and exten- 
sion of operand and result values, are performed during 
instruction execution, specifically during operand ac- 
cessing and result storing. 



(h) Logical Operations 



And 


Short Logic And 


SB X SB — SB 


Or 


Short Logic Or 


SB X SB — SB 


Xor 


Short Logic Exclusive Or 


SB X SB SB 


LLCycie 


Long Left Cycle 


LB X SF -* LB 


LRCycle 


Long Right Cycle 


LB X SF -* LB 


SLCycle 


Short Left Cycle 


SB X SF -* SB 


SRCycte 


Short Right Cycle 


SB X SF — SB 


Set 


Set Bit 


SB X SF — * SB 


Test 


Test Bit 


SB X SF — () 
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(i) Addressing Modes 

The mode field of an instruction specifies the address- 
ing rule for each operand and the result The basic 
modes are Normal (N), Constant (C), Indexed (I), and 
Reference (R), of which Indexed has two options and 
Reference has six options. These modes may be com- 
bined subject to the following restrictions: 

(a) In Indexed mode the second specifier gives the index 
value. Thus, Indexed mode does not apply to dyadic 
(two-operand) operators. 

(b) Indexed mode may be specified only for the first 
operand specifier and/or the result specifier. 

(c) The Reference mode options may be applied to the 
result specifier and/or just one operand specifier. 

(d) Constant data mode applies only to operand specifi- 40 
ers. The nomenclature is as follows: 



30 



35 



du 



Mode Option 
Increment width 
Data unit size 



45 



values to be used as an operand. It is converted to the 
number type required by the operation code. 

(3) Indexed Mode in Data Memory (ID) 

As shown in FIG. 9, the specifier gives the location of 
a control word in C-Mem. The operand is fetched from 
D-mem at the address computed by adding the index 
value (taken modulo 2*) to the D-Mem base address 
(which is padded as necessary). The index comes from 
the second specifier of the instruction, which cannot use 
Indexed Mode. 

(4) Indexed Mode in Array Memory (IA) 

As shown in FIG. 10, the Indexed Mode in the array 
memory is analogous to the Indexed Mode in the data 
memory. The base address is the offset added to the 
(padded) A-Mem base address. Note that the w field 
here is twice as large as the w field in ID mode. 

(5) Reference Mode (R) 

In the Reference Mode, the specifier gives the loca- 
tion of a control block in C-Mem. There are six cases 
distinguished by the mode option field in the specifier 
format: CM, UDQ, UAQ, CAQ, PSR, IOT. These are 
described as follows. 

(6) Control Memory Read/Write(CM) 

As shown in FIG. 11, the specifier contains the ad- 
dress of a location in the control memory, which serves 
as the source for an operand or the destination for a 
result. 

(7) Uncontrolled Queue in Data Memory (UDQ) 

As shown in FIG. 12, this is used for implementing 
small FIFO queues where programming conventions 
are used to prevent FIFO over/underflow. The data 
memory address is used as the operand address. Then 
the lowest w bits are incremented modulo 2 W and the 
new address is written back into the control memory. 

(8) Uncontrolled Queue in Array Memory (UAQ) 

As shown in FIG. 13, this mode is analogous to the 
UDQ mode, except that here the operand address is the 
array memory base address plus the index, and the index 
is the part that is incremented. 

(9) Controlled Queue in Array Memory (CAQ) 

As shown in FIG. 1*. pointers are kept to both ends 
of the queue. The in-index or the out-index is added to 



In Indexed and Reference modes, the data unit field 
specifies the size of the data unit fetched or stored. Type 

conversions to and from the types required by the oper- 50 the A-Mem base address and incremented (modulo 2*0 

depending on whether this specifier is an operand-spec 
or a result-spec. 



ation code are done (by truncation, padding with zeros 
or ones, sign extension). 



data unit field (du) 


data unit site 


00 


8 bit* 


01 


16 bits 


10 


32 bits 


11 


64 bits 



(1) Normal Mode (N) 

As shown in FIG. 7, the specifier is a D-Mem address 
for fetching an operand or storing a result 

(2) Constant Mode (C) 

As shown in FIG. 8, in the Constant Mode, the speci- 
fier is interpreted as an abbreviation for one of a set of 



(10) Packet Send/Receive(PSR) 

55 FIG. 15 illustrates the specifier and control block 
formats used when a data packet is to be transferred 
between two processing elements of the computer of 
FIG. 2. The structure for implementing this transfer is 
described below in connection with FIG. 26. 

60 

(11) Input/Output Transfer GOT) 

FIG. 16 illustrates the specifier, control block, and 
I/O buffer table formats for input and output transac- 
tions over 16 I/O channels. A specifier relates to an 
65 input or output transaction in an indicated channel ac- 
cording to whether it designates an operand or a result. 
The mode option is only two bits wide. Therefore, the 
two bits used to indicate IOT mode cannot be the first 
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two bits of any other mode option, which is acceptable 
since only 6 options are needed. If the specifier is an 
operand-spec, then the buffer memory entry for the 
indicated channel is read, and the instruction specified 
by the acknowledge field is signaled when the buffer 
has been re-filled. If the specifier is a result-spec, then 
the buffer memory entry is written to, and the instruc- 
tion specified is signaled when the buffer has been emp- 
tied. The channel buffer memory is part of the instruc- 
tion execution control system and contains all necessary 
information to preclude any delay in the operation of 
the execution system pending completion of an I/O 
wait 

(VIII) Instruction Set and Operation— Enable 
Component 

The encoding of the enable portion of the instructions 
will now will be described. The enable portion of an 
instruction includes a signal list field, a count field, and 



10 
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instruction index in the word. The effect of this action is 
that the reference fetch commands of succeeding Short 
Signal words are taken relative to the newly specified 
region of dataflow instructions. The base region the 
distribution memory contains one word for each of the 
4096 instructions held by a processing element. Three of 
the word formats can be used in the base region: Nor- 
mal Signal; Short Signal; and Jump. 

(4) Jump 

As shown in FIG. 20, the Jump word, which is here 
specified to have a 13 bit address field, instructs the 
signal system controller to take the next control word 
from the location in the distribution memory specified 
by the Jump address. 

(5) Conditional Jump 

As shown in FIO. 21, a Conditional Jump word al- 
lows a jump to be conditioned on the completion cod 



a reset field. The signal lists are stored in distribution 20 generated by a dataflow instruction. The following 



memory 154; the count and reset fields are held in 3-bit 
count registers 153 and 3-bit reset registers 155, which 
together constitute an indicator memory 152. In one 
form, the distribution memory consists of 16-bit words 
and includes a base or header region having one storage 25 
location for each dataflow instruction, and an extension 
region containing blocks of successive words that hold 
the remainder of signal lists for which one word is insuf- 
ficient The formats for the words making up signal lists 
are given in FIGS. 17 through 23 and described below. 30 
Each word may be thought of as a specialized instruc- 
tion to the signal system controller which either speci- 
fies a dataflow instruction to be signalled or specifies 
where to find and how to process the remaining words 
of the signal list. In the word formats, "instruction in- 35 
dex" means the index of an instruction to be signalled, 
"jump address** means the address in the distribution 
memory of the next signal list word to be interpreted, 
"instruction reference" means a 7-bit field that specifies 
an instruction index relative to the "current base", and 40 
"condition" means a 4-bit field that specifies a boolean 
test on the (optional) condition code that accompanies 
the done signal of a dataflow instruction. 
The signal list word formats are as follows: 

(1) Normal Signal 

As shown in FIG. 17, a Normal Signal word causes 
the instruction specified by the instruction index to be 
signalled. The 1-bit end field indicates whether or not 
this is the final word of a signal list. 

(2) Short Signal 

As shown in FIG. 18, a Short Signal word contains 
two instruction reference fields. Each of these fields is a 
7-bit relative address, relative to the current base. An 55 
instruction reference can specify a target instruction 
that is up to ±63 bits away from the current base. The 
current base initially is the index of the instruction asso- 
ciated with the signal list being processed. A zero in 
either "Instruction Reference** field of a Short Signal 60 
word marks the end of the signal list 

(3) Change Base 

A Change Base word causes redefinition of the cur- 
rent base. When processing of the signal list of a data- 65 
flow instruction is begun, the current base is the index of 
that dataflow instruction. When a Change Base word in 
a signal list is interpreted, the current base is set to the 



table specifies the tests on arithmetic results corre- 
sponding to the 16 possible codes in the cond field of the 
Conditional Jump words. If the completion code for the 
current dataflow instruction matches the test, then the 
next signal list word interpreted is the one at the Jump 
location in the distribution memory. 

Code 







a 


b 


c 


d 


Meaning 


a ™ 


error 


0 


0 


0 


0 


negative zero 


b = 


neg/pos 


0 


0 


0 


I 


negative over-range 


C a 


large/small 


0 


0 




0 


negative normal 


d » 


over; under/ 


0 


0 




I 


negative undcr-rangc 




normal; zero 


0 


1 




0 


positive zero 






0 


1 




L 


positive over-range 






0 


1 




0 


positive normal 






0 








positive under- range 








0 




0 


propagated error 








0 




I 


unknown value 








0 




0 


zero-divide 








0 




1 


unused 












0 










1 
1 




0 

1 
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The jump rel field is an 8-bit field that specifies a 
distribution memory location up to ±127 locations 
away from the address of the current word. (6) Skip 

As shown in FIG. 22, a Skip word contains a 4-bit 
Mask field and a 4-bit Sense field. This word causes 

i = Mask O (sense ©cc) 

to be evaluated, where cc is the completion code of the 
current dataflow instruction. If x=0002, the word im- 
mediately after the Skip word is interpreted next. If 
x^OOth, the next word is skipped and the following 
word is interpreted next If the end field is 1, interpreta- 
tion of the signal list is terminated following interpreta- 
tion of the word immediately following the Skip word 
(if it is interpreted at all), otherwise immediately. 

(7) Dispatch 

As shown in FIO. 23, a Dispatch word signals an 
instruction in a group of two, four, eight, or sixteen 
target instructions at consecutive indices. The control 
field specifies that one, two, three or four bits of the 
completion code are to be used to determine the data- 
flow instruction to be signalled. The selected bits from 
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the completion code are added to a base index to obtain 
the index of the instruction to be signalled. The base 
index is equal to the current base. 

(IX) Interprocessor Communication— FIG. 26 3 

It is desirable to support a high rate of data transmis- 
sion between two dataflow instructions, even when 
they reside in different processing elements. The pre- 
ferred embodiment of the present invention includes an 
efficient mechanism for implementation of such trans- 10 
mission. This mechanism, which is now to be described 
in connection with FIG. 26, permits an instruction in 
processing element A to send several data packets to an 
instruction in processing element B before requiring a 
returned acknowledgement indication from processing 15 
element B. In this way, it is possible to achieve a higher 
transmission rate than would be possible if receipt of an 
acknowledge indication were necessary before another 
data packet could be sent The mechanism allows pro- 
cessing element A to transmit up to some fixed number 20 
of data packets before receiving an acknowledge 
packet It involves two instructions, one in processing 
element A and one in processing element B. The in- 
struction in processing element A is a Send instruction 
160 having an S/R result specifier; the instruction in 25 
processing element B is a Receive instruction 162 hav- 
ing one S/R operand specifier. 

The nomenclature used in FIG. 26 and in the present 
description is as follows: 



rcv-pe 


Receiving processing element number 


snd-pe 


Sending processing element number 


rcv-cb 


Specifier of control block in receiving 




processing element 


sod-cb 


Specifier of control block in sending 




processing element 


snd-iwtr 


Index of send instruction 


rcv-instr 


Index of receive instruction 


count 


Number of data packets sent but not 




acknowledged. 


in-ptr 


Input pointer for FIFO buffer in 




receiving processing element 


out-ptr 


Output pointer for FIFO buffer in 




receiving processing element. 



Associated with the Send instruction 160 is a control 45 
block 164 which contains a count item c. Receive in- 
struction 162 has an associated control block 166 which 
contains a first index x (an In-pointer) and a last index y 
(an Out-pointer). Indices x and y refer to a buffer area in 
the data memory of processing element B, The mecha- 50 
nisra controls the transmission of data packets from 
processing element A to processing element B and the 
transmission of acknowledge packets from processing 
element B to processing element A. As shown in FIG. 
266, the data packet fields include: a processing element 55 
number, a control block address, and a data value. The 
fields of an acknowledge packet include a processing 
element number and a control block address. There are 
four transactions that implement the protocol supported 
by this structure: the firing of the Send instruction; the 60 
firing of the Receive instruction; the action of signalling 
the Send instruction when processing element A re- 
ceives an acknowledge packet; and the action of signal- 
ling the Receive instruction when processing element B 
receives a data packet. These transactions are described 65 
as follows: 

(a) Firing the Send instruction: (1) send the data 
packet; (2) increment the count c by 1 modulo the 
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buffer length; (3) if c#0, signal the Send instruction 
(itself). 

(b) Receipt of a data token (processing element B): (1) 
store the data value in the data memory at location x 
specified by the control block; (2) if x=»y, send a signal 
to the Receive instruction of processing element B; (3) 
increment x by 1 modulo the buffer length. 

(c) Firing the Receive instruction: (1) read the data 
item from the data memory at y as specified by the 
control block; (2) transmit an acknowledge packet con- 
taining the processing element number of the sending 
processing element and its control block address; (3) 
increment y by 1 modulo the buffer length; (4) if x#y, 
signal the Receive instruction (itself). 

(d) Receipt of an acknowledge token (processing 
element A): (1) if c=0, send a signal to the Send instruc- 
tion; (2) decrement c by 1 modulo the buffer length. 

(X) Pipelining in the Illustrated Embodiment— FIG. 24 

Of great importance in the architecture of the present 
invention is pipelining— the processing of data by suc- 
cessive stages of a computing mechanism so that each 
stage is usefully busy in every cycle of operation. Pipe- 
lining is used in the high performance arithmetic units 
of conventional supercomputers where the stages are 
subcomponents of an arithmetic operation such as float- 
ing point addition or multiplication. When a computa- 
tion is pipelined on a static dataflow computer, the 
stages of the pipeline are successive groups of nodes in 
a dataflow graph as illustrated in FIG. 24 — that is, each 
stage comprises a set of perhaps many arithmetic opera- 
tions. Note that, whereas the interconnection of stages 
is "hardwired" in the typical arithmetic unit, in a data- 
flow program these connections are the arcs between 
dataflow nodes that are part of the stored program. 

In consequence of the rule that any arc of a dataflow 
graph can hold at most one token, pipelined operation is 
the natural mode of operation of dataflow graphs. Yet, 
to achieve the highest computation rate, every path 
through a dataflow graph must contain exactly the same 
number of nodes. A dataflow graph arranged to support 
this sort of pipelined behavior is said to be maximally 
pipelined. An acyclic dataflow graph can always be 
transformed into a maximally pipelined graph by the 
addition of identity actors. In a balanced graph, the 
nodes divide into stages such that every arc is incident 
on one node in each stage. Then all nodes in even stages 
may fire alternately with all nodes in odd stages, and 
each node fires as often as possible. 

The instruction execution period places an upper 
bound on the computation rate of pipelined code. The 
rate cannot exceed the limit imposed by the tightest 
cycle in the machine code graph. FIG. 24 illustrates 
dataflow code for performing 

z:=(x*y+3.5)*(x*y-5.2) 
Every cycle is incident on two instructions, so that, 
given a computation rate of 50 kilohertz, each instruc- 
tion may fire at most once every 20 microseconds. At 
this rate, a processing element holding 400 instructions 
in pipelined code is running at 20 mips. With half of 
these instructions as floating point operation, ten mega- 
flops performance is achieved. Thus, as a practical mat- 
ter, the computation rate is high. 

(XI) Programming Example A 

FIG. 27 shows dataflow machine code for the qua- 
dratic formula. A signal at Start means that locations a, 
b and c have been properly set. An acknowledge signal 
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at Start means that the producer may set a, b and c 
again. The signalling at Done is similar. In the figure, a 
bracket indicates that signals are to be treated to imple- 
ment the logic OR operation; a circled dot indicates 
treatment to implement the logic AND operation. This 5 
code can be understood by comparing it with the data- 
flow graph in FIG. 4, although there is a difference in 
that the gate and merge nodes in FIG. 27 do not appear 
as distinct instructions. Rather, the test of D is imple- 
mented by instruction c2-l, which commands the en- 10 
able system to use one of two alternate signal lists ac- 
cording as D^O.O or D<0.0. These lists serve to acti- 
vate the instructions for one arm or the other of the 
conditional construction. Generally, the merge is imple- 
mented simply by having the instructions of each condi- 1 5 
tional arm — c3-4 through c3-7 for the TRUE arm and 
c3-10 through c-13 for the FALSE arm— write into the 
same location of data memory. Because skew in timing 
might lead one arm to write its result too early or too 
late yielding nondetenninate behavior, a second test of 20 
D, instruction c3-l t is included to ensure that the code 
is faithful to the dataflow graph. The outcome of the 
test is used to signal the cells that are next in line to 
write into the result locations. 

In FIG. 27, acknowledge signals, ack-1, ack-2, ack-3 25 
which indicate completion of the unit of work in each of 
the stages identified in the diagram. Signal ack-2, for 
example, tells Stage 3 to begin processing the work unit 
just completed by Stage 2, and also tells Stage 1 that it 
is permitted to begin processing its next work unit This 30 
code is not maximally pipelined because the paths 
through each stage contain one, two, or three instruc- 
tions instead of only one. Explanation of the maximally 
pipelined version of this code, which contains a large 
number of signal arcs that are somewhat difficult to 35 
comprehend, is not needed for a full understanding of 
the present invention. 

Instructions c2-2, c2-3 and c3-l are "overhead" in- 
structions, the primary function of which is to copy 
values generated by other instructions. These are in- 40 
eluded in the machine code to achieve a balanced pipe- 
line configuration The need for these overhead instruc- 
tions is avoided by using the result queueing scheme of 
FIG. 28. In FIG. 28(a), instruction c2 serves as a buffer 
for the result of instruction cl which permits execution 45 
of instruction cl in order to permit execution of instruc- 
tion cl again before instruction c2 causes execution of 
instruction c3. The same advantage may be obtained by 
providing instruction cl with a result queue as in FIG. 
28(b). Instructions cl' and c3' use index counters ® 50 
and (§) to address elements of the queue. The notation 
means that the index counter ® incremented 
byOTJE before it is used to store or access an item in the 
queue. Note that, withjust the signalling indicated in 
FIG. 28(6V counter (§) never can advance beyond 55 
counter @ , so no element of the queue will be read 
before it is written. Although counter @ may advance 
arbitrarily far ahead of counter (B) in principle, the 
code as a practical matter includes other constraints that 
limit this interval to a small integer. The result queue, in 60 
one form, is implemented as a ring buffer in the data 
memory of a processing element. 

The use of this result queue structure eliminates three 
instructions from the machine code of FIG. 27. In this 
case, the code consists of twenty-one instructions of 65 
which fourteen are used by each work unit as it passes 
down the pipeline, and the average number of arithme- 
tic operations performed per work unit is ten. 



(XII) Programming Example B 

Large scale scientific computation generally involves 
large arrays of data. The efficient processing of large 
volumes of data in this form is supported by the data- 
flow computer of the present invention. In dataflow 
computation, it is not satisfactory to view arrays, as in 
Fortran, as a set of values occupying successive loca- 
tions in a memory. Two alternative views useful in 
dataflow computation are illustrated in FIG. 29. In one 
view, an array is regarded as a set of scalar values car- 
ried simultaneously by several dataflow arcs, the array 
being distributed in space. In the other view, an array is 
regarded as a set of successive values carried by tokens 
traversing a single dataflow arc, the array being distrib- 
uted in time. The choice between distributing in space 
or in time is an important consideration in the design of 
a compiler for matching parallelism in an algorithm to 
the parallelism supportable by the target computer. In 
many large scale codes, the parallelism available in the 
algorithm is so great that distributing most arrays in 
time is necessary to achieve balanced use of machine 
resources. 

The following example shows how dataflow machine 
code in the form of pipelined code effective by imple- 
ments a process that consumes data from one or more 
large multidimensional arrays and generates a large 
multidimensional array as its result. The example is 
Smooth, a module from an aerodynamic simulation 
code. It accounts for about five percent of all arithmetic 
operations performed in a run of the complete code. 
This example uses a cubical grid divided into 128 inter- 
vals along each of three coordinate directions. Five 
physical quantities are associated with each of the 
128 X 128 X 128 grid points. The purpose of Smooth is to 
produce new data for the grid by applying a smoothing 
formula to sets of adjacent points. To compute the new 
data for point (l,k j) the old values at distances ±2, ±1, 
and 0 parallel to one of the coordinate axes are used in 
a weighted sum. Simpler formulas are used at the 
boundaries. The smoothing process is carried out as a 
separate computation for each direction, each computa- 
tion using the result of the previous computation, as 
illustrated in FIG. 30, 

The Val program of FIG. 31 defines the smoothing 
process for the coordinate associated with index j, and 
for just one of the five physical quantities. This program 
makes good use of the Val forall construct, which de- 
fines a new array value in terms of existing value defini- 
tions. For example, the element-by-element sum of two 
vectors may be written as 

forall i in [l,n]construct A[i]+B[iJ endall 

When considering how the Smooth module may be 
implemented as a machine code for static dataflow 
supercomputer, an immediate observation is that the 
amount of data involved is very large, much too large to 
be held in the data memories of the processing elements. 
Moreover the program modules that perform the 
smoothing operation in the two other directions need 
access to the data in a different order, at least if the 
formulas are to be effectively pipelined. For this reason, 
each module must generate and store its results com- 
pletely before the next module may begin. A local array 
memory, a large and therefore slower memory than the 
data memory, is associated with each processing ele- 
ment to hold large data objects such as the arrays of 
intermediate data needed in this example. 
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A dataflow machine program for the Smooth module 
is shown in FIGS. 32, 33. In successive executions, the 
instruction Ocnerate(a,b) produces internally the se- 
quence of integers starting with "a" and ending with 
"b"; following this sequence of executions, it produces 5 
the symbol * on its next execution; this cycle is repeated 
endlessly thereafter. The signal list of a Generate in* 
struction may condition transmission of signals on prop- 
erties of the integer values it produces, or on the pres- 
ence or absence of the symbol *. For example, a signal 10 
arc labelled [5 . , . 128] sends a signal if the integer 

fenerated is in the indicated range. The set notation 
1 ,2} is used to specify that the target instruction is to be 
signalled if the integer belongs to the set Some instruc- 
tions call for accessing operand values from array mem- I 5 
ory, or storing result values in array memory. 

An overview of the operation of this code is as fol- 
lows: Three of the Generate instructions on the left set 
indices 11, kl, and jl to successive points of the grid 
with jl being the fast index. Array memory fetch in- 20 
structions c2-l and c2-2 move elements of input data 
arrays Q and D into result queues in data memory. Note 
that the generate instruction c2-3 waits for five values to 
enter each of the result queues before signalling the 
product instructions c2-4. . J . . c2-8. In the product 25 
instructions, the notation ^ Afo neans that the element 
four items earlier than the one at ® in the result queue 
is to be read. This accomplishes the desired offset ad- 
dressing so that all five product terms may be computed 
at once. The three boxes Bl, B2 and B3 contain the 30 
instructions for evaluating the two body expressions F 
and G — two copies of F and one copy of G. Box Bl 
handles only element 2 of each row; box B2 handles 
elements 3 through 126, and box B3 handles element 
127. FIG. 34 shows how the signalling arrangement 35 
ensures that each of these boxes receives the correct 
product values. 

Instructions c3-2 c3-4 and the Generate instruc- 
tion c5-3 implement the determinate merging of results 
of the conditional arms. Instructions c4-l and c4-2 per- 40 
form the final computation steps and store the results in 
the array memory. Instructions c4-3 and c4-4 store 
values of the end elements of each row. To remain in 
step, each of the three index counters @ , (S) , and 
© must be stepped the same number ot times in the 45 
processing of each row; so extra steps must be inserted 
where data items held in a result queue are not refer- 
enced. Thus instruction c4-3 is included to step counter 
© for elements 1 and 128 of the D array produced by 
instruction c2-2. The three generate instructions on the 50 
right are needed to ensure that the Done signal is not 
sent before all result values have been stored safely 
away in the array memories. The remaining instructions 
on the left, c6-l through c6-5, store old data values in 
the boundary elements of the new array, as called for 55 
the Val program. 

Determining the amount of data memory required for 
the two result queues involves the following consider- 
ations. For values of q produced by instruction c2-l, 
five locations are necessary and sufficient because, once 60 
the first five values have been produced, the signalling 
allows further values to be produced only as queue 
elements are released by firings of the product cells. In 
determining the size of the result queue for instruction 
c2-2, it is necessary to determine how many values of d 65 
may be produced in advance of their use by instruction 
c4-l. The Petri net of FIG. 35 is helpful in making this 
evaluation. Each transition models the execution of the 
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indicated instruction in FIGS. 32, 33, and the placement 
of tokens corresponds to the initial count and reset 
values in the machine code. (The number of transitions 
used to represent box B2 reflects the depth of the ex- 
pression for G which is four.) Examination of the Petri 
net shows that transition c2-2 can fire at most seven 
times before the first firing of c4-l and, therefore, that a 
queue of seven elements is sufficient. 

To complete this discussion of the Smooth module, 
consider its performance parameters in reference to 
possible full performance of a dataflow computer with 
256 processing elements. For the smoothing of all five 
physical quantities of the problem, the counts of arith- 
metic operations and array memory read/write opera- 
tions are as follows (the Jacobian array D need be read 
only once): 



adds 


multiplies 


divides 


reads 


writes 


25n 3 


40n> 


5n* 


lU 3 


5n* 



Since three data memory cycles are required for each 
arithmetic operation, the ratio of data memory accesses 
to array memory accesses is (3 70)/16=13.l. The pro- 
cessing element architecture of FIGS. 1 and 5 incorpo- 
rates structure that corresponds to this design consider- 
ation. 

The rate at which instructions in the machine code 
can fire is controlled by the most restrictive cycle in the 
graph. There are three cycles of size four instructions in 
the dataflow graph, as shown in FIG. 36. These cycles 
involve Generate instructions, result queue manipula- 
tion, and array memory accesses, as well as arithmetic 
operations. Ordinarily, the most limiting cycle is Cycle 
3 since it contains a division and also a read/write of the 
array memory. The time span for execution of an in- 
struction has two parts, that due to the enable system 
and that due to the execution system. The former com- 
ponent ordinarily depends on how many enabled in- 
structions are available to a processing element, and the 
tatter component depends on the complexity of the 
instruction executed. The following calculation is based 
on a cycle time of 30 microseconds for Cycle 3 and the 
assumption that this cycle determines the rate of com- 
putation. The total number of arithmetic operations 
performed for each work unit entering the pipeline is 
70. In an embodiment having a repetition cycle of 30 
microseconds duration, these instructions can support 
70/(30 X 10~ 6 )«2.3 million operations per second, not 
enough to keep one processing element busy, let alone 
256. The solution to this mismatch is to use many copies 
of the code. The use of four copies in each processing 
element supports 4x2.33=9.32 megaflops of perfor- 
mance for each processing element, which is close to 
the peak performance of the machine. 

What is claimed is: 

1. Data processing means for effecting data process- 
ing transactions according to programs of instructions 
including predecessor-successor pairs of instructions, a 
predecessor instruction having one or more successor 
instructions, a successor instruction having one or more 
predecessor instructions, said instruction having indi- 
ces, said data processing means comprising: 
(a) first means for effecting the transaction of instruc- 
tion execution components specifying operands 
and operators to produce execution results pertain- 
ing to further operands and operators, and execu- 
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tion completion signals pertaining to said execution output data characterized by members of the class con- 
results; sisting of result values and result conditions. 

(b) second means for effecting the transaction of in- 7. The data processing means of claim 4 wherein said 
struction enable components specifying instruction execution memory means stores instruction execution 
execution sequencing to produce enable events 5 components at locations corresponding to instruction 
pertaining to further instruction execution sequenc- index specifiers and operands at locations correspond- 
ing, and enable completion signals pertaining to ing to operand address specifiers. 

said enable events; The data processing means of claim 4 wherein an 

(c) associations of said execution components and instruction execution transaction includes retrieval of 
said enable components, and associations of said 1° an instruction execution component and corresponding 
execution completion signals and said enable com- input data from locations in said execution memory 
pletion signals having associations of indices corre- means, execution of an instruction execution component 
sponding to said instruction indices; by said arithmetic/logic means to produce output data, 

(d) third means for the transmission of said enable and storage of output data in locations in said execution 
• completion signals from said second means to said 15 memory means. 

first means, and for the transmission of said execu- 9. The data processing means of claim 4 wherein said 
tion completion signals from said first means to said execution control means is responsive to an enable corn- 
second means; pletion signal to cause an instruction execution transac- 

(e) a transaction of a given execution component and tion, and, after completion thereof, to cause generation 
a transaction of a given enable component occur- 20 of an execution completion signal. 

ring respectively in said first means and said second 10. The data processing means of claim 4 wherein 

means, said given execution component and said said indicator memory means includes count registers 

given enable component having associated indices and reset registers. 

corresponding to a given instruction index; 11. The data processing means of claim 10 wherein 

(0 a transaction of a given instruction being condi- said count registers and said reset registers, respectively 

tioned on the occurrence of completion signals for a given instruction, provide a current enable count 

generated by the transaction of certain predecessor that records completions of executions of related prede- 

and successor instructions of said given instruction; cessor and successor instructions, and a reset enable 

(g) transactions of said execution components subject ^ count that restores a given register to reset condition 

to said enable completion signals, and transactions when a given instruction has been selected for execu- 

of said enable components subject to said execution tion. 

completion signals being overlapped to occur 12. The data processing means of claim 4 wherein the 

freely and concurrently with respect to each other. instruction enable component of an instruction includes 

2. The data processing means of claim 1 wherein an 33 instruction index specifiers of one or more of its succes- 
enable signal consists of a reference to an instruction sor instructions. 

index and a done signal consists of a reference to an 13. The data processing means of claim 12 wherein 

instruction index and at least one member of the class said distribution memory means stores lists of said in- 

consisting of condition codes and the null code, said struction index specifiers, a given one of said lists per- 

condition codes being members of the class consisting 40 taming to a given one of said successor instructions, 

of a few Boolean properties and logical combinations 14. The data processing means of claim 4 wherem 

thereof, said Boolean properties being properties of the said enable control means, on occurrence of one or 

execution transaction which generated said last men- more enable conditions for one or more enabled instruc- 

tioned done signal. tions, causes selection of an enabled instruction and 

3. The data processing means of claim 1 wherein said 45 generation of an enable completion signal referring to 
instruction execution components are characterized by the instruction index specifier of said enabled instruc- 
a first operand address specifier, a second operand ad- tion. 

dress specifier and a result address specifier. 15. The data processing means of claim 4 wherein 

4. The data processing means of claim 1, said first said enable control means, on receipt of an mstruction 
means being an execution system including execution 50 completion signal, causes reference thereto in said mdi- 
memory means, arithmetic/logic means, and execution cator memory in correspondence with certain of the 
control means, said second means being an enable sys- instruction index specifiers of successor instructions of 
tern including enable memory means and enable control said enabled instruction. 

means, said enable memory means including indicator 16. The data processing means of claim 1 wherem a 

memory means and distribution memory means, said 55 processing transaction includes generation of an enable 

execution control means and said enable control means completion signal following occurrence of an enable 

being operatively connected for communication there- condition, an arithmetic/logic transaction to generate 

between of enable completion signals and execution an execution completion signal, and indication thereof 

completion signals, said enable completion signals and in said indicator memory means, 

said execution completion signals including instruction 60 17. A computer for effecting data processing transac- 

index specifiers. tions according to programs of instructions including 

5. The data processing means of claim 1 wherein said predecessor-successor pairs of instructions, a predeces- 
instruction execution components include opcodes and sor instruction having one or more successor uistruc- 
at least a member of the class consisting of data con- tions, a successor instruction having one or more prede- 
stants, input data address specifiers and output data 65 cessor instructions, said instructions having indices, said 
address specifiers. computer comprising (A) a plurality of processing ele- 

6. The data processing means of claim 1 wherein the ment means, (B) a plurality of array memory means, (C) 
data includes input data having operand values and a plurality of data path means, and (D) routing network 
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means; each of said processing element means compris- 
ing: 

(a) first means for effecting the transaction of instruc- 
tion execution components specifying data and 
data operators to produce execution results per- 5 
taining to further data and data operators, and 
execution completion signals pertaining to said 
execution results; 

(b) second means for effecting the transaction of in- 
struction enable components specifying instruction 10 
execution sequencing to produce enable events 
pertaining to further instruction execution sequenc- 
ing, and enable completion signals pertaining to 
said enable events; 

(c) associations of said execution components and 15 
said enable components, and associations of said 
execution completion signals and said enable com- 
pletion signals having associations of indices corre- 
sponding to said instruction indices; 

(d) third means for the transmission of said enable 20 
completion signals from said second means to said 
first means, and for the transmission of said execu- 
tion completion signals from said first means to said 
second means; 

(e) a transaction of a given execution component and 23 
a transaction of a given enable component occur- 
ring respectively in said first means and said second 
means, and given execution component and said 
given enable component having associated indices 
corresponding to a given instruction index; 30 

(0 a transaction of a given instruction being condi- 
tioned on the occurrence of completion signals 
generated by the transaction of certain predecessor 
and successor instructions of said given instruction; 
given ones of said processing element means and given 35 
ones of said array memory means transmitting data and 
data operators therebetween only via given ones of said 
data path means, any one of said processing element 
means and any other of said processing element trans- 
mitting data and data operators therebetween via said 40 
routing network. 

18. The computer of claim 17 wherein said first means 
is an execution system including execution memory 
means, arithmetic/logic means, and execution control 
means, said second means is an enable system including 45 
enable memory means and enable control means, said 
enable memory means including indicator memory 
means and distribution memory means, said execution 
control means and said enable control means being 
operatively connected for communication therebe- 50 
tween of enable completion signals and execution com- 
pletion signals, said enable completion signals and said 
execution completion signals including instruction 
index specifiers. 

19. The computer of claim 18 wherein said instruc- 55 
tion execution components include opcodes and at least 

a member of the class consisting of data constants, input 
data address specifiers and output data address specifi- 
ers. 

20. The computer of claim 19 wherein an enable 60 
signal consists of a reference to an instruction index and 

a done signal consists of a reference to an instruction 
index and at least one member of the class consisting of 
condition codes and the null code, said condition codes 
being members of the class consisting of a few Boolean 65 
properties and logical combinations thereof, said Bool- 
ean properties being properties of the execution transac- 
tion which generated said last mentioned done signal. 
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21. The computer of claim 20 wherein said execution 
memory means stores instruction execution components 
at locations corresponding to instruction index specifi- 
ers and operands at locations corresponding to operand 
address specifiers. 

22. The computer of claim 21 wherein an instruction 
execution transaction includes retrieval of an instruc- 
tion execution component and corresponding input data 
from locations in said execution memory means, execu- 
tion of an instruction execution component by said 
arithmetic/logic means to produce output data, and 
storage of output data in locations in said execution 
memory means. 

23. A dataflow processing element for effecting pro- 
cessing element transactions on data according to pro- 
grams of instructions, selected predecessor-successor 
pairs of instructions being related by functional depen- 
dencies, a predecessor instruction having one or more 
successor instructions, a successor instruction having 
one or more predecessor instructions, said processing 
element comprising: 

(a) execution system means including execution mem- 
ory means, arithmetic/logic means, and execution 
control means; and 

(b) enable system means including enable memory 
means and enable control means, said enable mem- 
ory means including indicator memory means and 
distribution memory means; 

(c) said execution control means and said enable con- 
trol means being operatively connected for com- 
munication therebetween of fire commands and 
done commands; said fire commands being trans- 
mitted from said enable control means to said exe- 
cution control means, and said done commands 
being transmitted from said execution control 
means to said enable control means, said instruc- 
tions including instruction execution components 
and instruction enable components, instructions 
being identified by instruction indices; said fire 
commands and done commands including instruc- 
tion index specifiers; 

(d) said instruction execution components including 
opcodes and data address specifiers including input 
data address specifiers and output data address 
specifiers; input data including operand values and 
output data characterized by result values and re- 
sult conditions; said execution memory means stor- 
ing instruction execution components at locations 
corresponding to instruction index specifiers and 
data, including input data and output data, at loca- 
tions corresponding to input data address specifiers 
and output data address specifiers; 

(e) an instruction execution transaction including 
retrieval of an instruction execution component 
and corresponding input data from locations in said 
execution memory means, execution of an instruc- 
tion execution component by said arithmetic/logic 
means to produce output data, and storage of out- 
put data in locations in said execution memory 
means; 

(0 said execution control means being responsive to a 
Are command to cause an instruction execution 
transaction, and, after completion thereof, indica- 
tion of a result condition, to cause generation of a 
done command; 

(g) said indicator memory means including indicators 
corresponding to instruction indices for cumula- 
tively recording occurrences of done commands 
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corresponding to instruction indices, an enable 
condition being established by said indicators for 
an instruction when said indicator memory means 
has indicated the occurrences of done commands 
for predecessor instructions thereof as determined 5 
by said result conditions of said predecessor in- 
structions as transmitted by said done commands; 

(h) the instruction enable component of an instruction 
including instruction index specifiers of one or 
more of its successor instructions, said distribution 
memory means storing said instruction enable com- 
ponents; 

(i) said enable control means, on occurrence of at 
least one enable condition for at least one enabled 
instruction, causing selection of an enabled instruc- 
tion and generation of a fire command containing 
the instruction index specifier thereof; said enable 
control means, on receipt of a done command, 
causing indication thereof in indicators selected by 
said distribution memory means in correspondence 
with certain of the instruction index specifiers of 
successor instructions of said enabled instruction; 

(j) a processing element transaction including genera- 
tion of a fire command following occurrence of an 
enable condition, an arithmetic/logic transaction to 
generate a done command, and indication thereof 
in said indicator memory means, said processing 
element transactions occurring at times that are 
arbitrary with respect to a given order of said in- 
struction indices. 
24. A computer for effecting data processing transac- 
tions according to programs of instructions including 
selected predecessor-successor pairs of instruction, a 
predecessor instruction having one or more successor 
instructions, a successor instruction having one or more 
predecessor instructions, said instructions having indi- 
ces, said computer comprising (A) a plurality of pro- 
cessing element means, (B) a plurality of array memory 
means, and (C) routing network means; 
(I) each of said processing element means comprising 40 

(a) execution system means including execution 
memory means, arithmetic/logic means, and 
execution control means; 

(b) enable system means including enable memory 
means and enable control means, said enable 45 
memory means including indicator memory 
means and distribution memory means; 

(c) said execution control means and said enable 
control means being operatively connected for 
communication therebetween of fire commands 
and done commands; said fire commands being 
transmitted from said enable control means to 
said execution control means, and said done 
commands being transmitted from said execution 
control means to said enable control means; said 
instructions including instruction execution com- 
ponents and instruction enable components, in- 
structions being identified by instruction indices; 
said first commands and done commands includ- 
ing instruction index specifiers; 

(d) said instruction execution components includ- 
ing opcodes and data address specifiers including 
input data address specifiers and output data 
address specifiers; input data including operand 
values and output data including at least a mem- 
ber of the class consisting of result values and 
result conditions; said execution memory means 
storing instruction execution components at lo- 
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cations corresponding to instruction index speci- 
fiers and data including input data and output 
data at locations corresponding to input data 
address specifiers and output data address speci- 
fiers; 

(e) an instruction execution transaction including 
retrieval of an instruction execution component 
and corresponding input data from locations in 
said execution memory means, execution of an 
instruction execution component by said arith- 
metic/logic means to produce output data, and 
storage of output data in locations in said execu- 
tion memory means; 
(0 said execution control means being responsive 
to a fire command to cause an instruction execu- 
tion transaction, and, after completion thereof, 
indication of a result condition, to cause genera- 
tion of a done command; 

(g) said indicator memory means including indica- 
tors corresponding to instruction indices for 
cumulatively indicating occurrences of done 
commands corresponding to instruction indices, 
an enable condition being established by said 
indicators for an instruction when said indicator 
has indicated the occurrences of done commands 
for predecessor instructions thereof as deter- 
mined by said result conditions of said predeces- 
sor instructions as transmitted by said done com- 
mands; 

(h) the instruction enable component of an instruc- 
tion including instruction index specifiers of one 
or more of its successor instructions, said distri- 
bution memory means storing said instruction 
enable components; 

(i) said enable control means, on establishment of 
one or more enable conditions for one or more 
enabled instructions, causing selection of an en- 
abled instruction and generation of a fire com- 
mand containing the instruction index specifier 
thereof; said enable control means, on receipt of 
a done command, causing indication thereof in 
indicators selected by said distribution memory 
means in correspondence with certain of the 
instruction index specifiers of successor instruc- 
tions of said enabled instruction; 

(j) a processing element transaction including gen- 
eration of a fire command following occurrence 
of an enable condition, an arithmetic/logic trans- 
action to generate a done command, and indica- 
tion thereof in said indicator memory means, said 
processing element transactions occurring at 
times that are arbitrary with respect to a given 
order of said instruction indices; 
(II) given ones of said processing element means and 
given ones of said array memory means transmit- 
ting packets of data to each other only via said 
routing network means, any one of said processing 
element means and any other of said processing 
element means transmitting instruction execution 
components therebetween via said routing net- 
work. 

25. A data processing method for effecting data pro- 
cessing transactions according to programs of instruc- 
65 tions including predecessor-successor pairs of instruc- 
tions, predecessor instruction having one or more suc- 
cessor instructions, a successor instruction having one 
or more predecessor instructions, said instruction hav- 
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ing indices, said data processing method comprising the 
steps of: 

(a) transacting in a first means instruction execution 
components specifying operands and operators to 
produce execution results pertaining to further 
operands and operators, and execution completion 
signals pertaining to said execution results; 

(b) transacting in a second means instruction enable 
components specifying instruction execution se- 
quencing to produce enable events pertaining to 
further instruction execution sequencing, and en- 
able completion signals pertaining to said enable 
events; 

(c) said transacting steps including the steps of select- 
ing associations of said execution components and 
aid enable components, and associations of said 
execution completion signals and said enable com- 
pletion signals having associations of indices corre- 
sponding to said instruction indices; 

(d) transmitting said enable completion signals from 20 
said second means to said first means, and transmit- 
ting said execution completion signals from said 
first means to said second means; 

(e) transacting a given execution component and 
transacting a given enable component respectively 25 
in said first means and said second means, said 
given execution component sand said given enable 
component having associated indices correspond- 
ing to a given instruction index; and 

(0 a transaction of a given instruction being condi- 30 
tioned on the occurrence of completion signals 
generated by the transaction of certain predecessor 
and successor instructions of said given instruction. 
26. Data processing means for effecting data process- 
ing transactions according to programs of instructions, 35 
said data processing means comprising: 

(a) execution means for effecting execution transac- 
tions and enable means for effecting enable transac- 
tions; 

(b) said instructions having instruction indices, and 40 
including (1) execution specifiers having execution 
indices and (2) enable specifiers having enable indi- 
ces, certain of said enable specifiers referring to a 
plurality of instruction indices; 

(c) given instruction indices corresponding to given 45 
associations of execution indices and enable indi- 
ces; 

(d) completions of execution transactions causing 
transmissions of done signals from said execution 
means to said enable means; 50 

(e) completions of enable transactions causing trans- 
missions of fire signals from said enable means to 
said execution means; 

(0 said enable means being responsive to sets for said 
done signals for performing enable transactions 55 
according to said enable specifiers to determine sets 
of said fire signals; 

(g) said execution means being responsive to sets of 
said fire signals to perform execution transactions 
according to said execution specifiers on functional 60 
sets of operators and operands, said execution 
transactions determining sets of said done signals; 

(h) the occurrence of a particular fire signal being 
conditioned on the reception by the enable means 
of at least one of said done signals, each of which 65 
corresponds to an enable specifier that refers to the 
instruction index associated with said particular 
fire signal. 
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27. The data processing means of claim 26 wherein 
one of said done signals includes at least one member of 
the class consisting of condition codes and the null 
code, said condition codes being members of the class 
consisting of a few Boolean properties and logical com- 
bination thereof, said Boolean properties being proper- 
ties of the execution transaction which generated said 
one of said done signals. 

28. The data processing means of claim 26 wherein 
said execution means includes execution memory 
means, arithmetic/logic means, and execution control 
means, said execution control means effecting the trans- 
action of execution specifiers designating operands and 
operators to produce execution results pertaining to 
further operators and operands. 

29. A computer for effecting data processing transac- 
tions according to programs of instructions, said com- 
puter comprising (A) a plurality of processing element 
means, and (B) routing network means; 

(I) each of said processing element means comprising: 

(a) execution means for effecting execution transac- 
tions and enable means for effecting enable transac- 
tions; 

(b) said instructions having instruction indices, and 
including (I) execution specifiers having execution 
indices and (2) enable specifiers having enable indi- 
ces, certain of said enable specifiers referring to a 
plurality of instruction indices; 

(c) given instruction indices corresponding to given 
associations of execution indices and enable indi- 
ces; 

(d) completions of execution transactions causing 
transmissions of done signals from said execution 
means to said enable means; 

(e) completions of enable transactions causing trans- 
missions of fire signals from said enable means to 
said execution means; 

(f) said enable means being responsive to sets of said 
done signals for performing enable transactions 
according to said enable specifiers to determine 
sequence of said execution transactions: 

(g) said execution means being responsive to sequen- 
ces of said fire signals to perform execution transac- 
tions according to said execution specifiers on 
functional sets of operators and operands, said exe- 
cution transactions determining sets of said done 
signals; 

(h) the occurrence of a particular fire signal being 
conditioned on the reception by the enable means 
of at least one of said done signals, each of which 
corresponds to an enable specifier that refers to the 
instruction index associated with said particular 
fire signal. 

(II) sets of said processing element means transmit- 
ting packets of data to each other via said routing 
network means, said packets of data referring to 
said instruction execution components, said enable 
components and said operands. 
30. Data processing means for effecting data process- 
ing transactions according to programs of instructions, 
said data processing means comprising: 
(a) execution means for effecting execution transac- 
tions and enable means for effecting enable transac- 
tions, completions of execution transactions caus- 
ing transmission of done signals from said execu- 
tion means to said enable means, completions of 
enable transactions causing transmission of fire 
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signals from said enable means to said execution 
means; 

(b) said instructions including execution specifiers 
and enable specifiers, aid execution specifiers hav- 
ing execution indices and said enable specifiers 5 
having enable indices, given associations of execu- 
tion indices and enable indices constituting given 
instruction indices; 

(c) said enable means being responsive to sets of said 
done signals for performing enable transactions 10 
according to said enable specifiers, said execution 
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means being responsive to sets of said fire signals to 
perform execution transactions according to said 
execution specifiers on sets of operators and oper- 
ands; 

(d) the occurrence of a particular fire signal being 
conditioned on the reception by the enable means 
of at least one of said done signals, each of which 
corresponds to an enable specifier that refers to the 
instruction index associated with said particular 
fire signal. 
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